视听自动语音识别(AV-ASR)是ASR的扩展,它通常来自扬声器嘴的动作。与仅关注唇部运动的作品不同,我们研究了整个视觉框架(视觉动作,对象,背景等)的贡献。这对于不一定可见的说话者不一定可见的视频特别有用。为了解决这项任务,我们提出了一个新的序列到序列视听ASR变压器(Avatar),该序列是从频谱图和全帧RGB端到端训练的。为了防止音频流主导训练,我们提出了不同的单词掩盖策略,从而鼓励我们的模型注意视觉流。我们证明了视觉模态对2 AV-ASR基准测试的贡献,尤其是在模拟噪声的情况下,并表明我们的模型以很大的边距优于所有其他先前的工作。最后,我们还为AV-ASR创建了一个名为Visspeech的新的现实世界测试床,该床在挑战性的音频条件下展示了视觉模态的贡献。
translated by 谷歌翻译
大规模未标记数据集的预培训显示了计算机视觉和自然语言处理领域的令人印象深刻的性能改进。鉴于大规模教学视频数据集的出现,预训练视频编码器的常见策略是使用随附的语音作为弱监管。但是,由于演讲用于监督预培训,视频编码器从未见过,这不会学会处理该模态。我们解决了当前预训练方法的这种缺点,这未能利用口语语言中的丰富的线索。我们的提议是使用所有可用的视频模型作为监督,即外观,声音和转录语音预先列车。我们在输入中掩盖了整个模态并使用其他两个模态预测它。这鼓励每个码头与其他方式合作,我们的视频编码器学会处理外观和音频以及语音。我们展示了我们在How2R,YouScook2和浓缩电影数据集上视频检索的“模态屏蔽”预培训方法的卓越性能。
translated by 谷歌翻译
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
We propose a fully unsupervised method to detect bias in contextualized embeddings. The method leverages the assortative information latently encoded by social networks and combines orthogonality regularization, structured sparsity learning, and graph neural networks to find the embedding subspace capturing this information. As a concrete example, we focus on the phenomenon of ideological bias: we introduce the concept of an ideological subspace, show how it can be found by applying our method to online discussion forums, and present techniques to probe it. Our experiments suggest that the ideological subspace encodes abstract evaluative semantics and reflects changes in the political left-right spectrum during the presidency of Donald Trump.
translated by 谷歌翻译
Classifying forecasting methods as being either of a "machine learning" or "statistical" nature has become commonplace in parts of the forecasting literature and community, as exemplified by the M4 competition and the conclusion drawn by the organizers. We argue that this distinction does not stem from fundamental differences in the methods assigned to either class. Instead, this distinction is probably of a tribal nature, which limits the insights into the appropriateness and effectiveness of different forecasting methods. We provide alternative characteristics of forecasting methods which, in our view, allow to draw meaningful conclusions. Further, we discuss areas of forecasting which could benefit most from cross-pollination between the ML and the statistics communities.
translated by 谷歌翻译
Incorporating prior knowledge of physics laws and structural properties of dynamical systems into the design of deep learning architectures has proven to be a powerful technique for improving their computational efficiency and generalization capacity. Learning accurate models of robot dynamics is critical for safe and stable control. Autonomous mobile robots, including wheeled, aerial, and underwater vehicles, can be modeled as controlled Lagrangian or Hamiltonian rigid-body systems evolving on matrix Lie groups. In this paper, we introduce a new structure-preserving deep learning architecture, the Lie group Forced Variational Integrator Network (LieFVIN), capable of learning controlled Lagrangian or Hamiltonian dynamics on Lie groups, either from position-velocity or position-only data. By design, LieFVINs preserve both the Lie group structure on which the dynamics evolve and the symplectic structure underlying the Hamiltonian or Lagrangian systems of interest. The proposed architecture learns surrogate discrete-time flow maps instead of surrogate vector fields, which allows better and faster prediction without requiring the use of a numerical integrator, neural ODE, or adjoint techniques. Furthermore, the learnt discrete-time dynamics can be combined seamlessly with computationally scalable discrete-time (optimal) control strategies.
translated by 谷歌翻译
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.
translated by 谷歌翻译
Denoising diffusions are state-of-the-art generative models which exhibit remarkable empirical performance and come with theoretical guarantees. The core idea of these models is to progressively transform the empirical data distribution into a simple Gaussian distribution by adding noise using a diffusion. We obtain new samples whose distribution is close to the data distribution by simulating a "denoising" diffusion approximating the time reversal of this "noising" diffusion. This denoising diffusion relies on approximations of the logarithmic derivatives of the noised data densities, known as scores, obtained using score matching. Such models can be easily extended to perform approximate posterior simulation in high-dimensional scenarios where one can only sample from the prior and simulate synthetic observations from the likelihood. These methods have been primarily developed for data on $\mathbb{R}^d$ while extensions to more general spaces have been developed on a case-by-case basis. We propose here a general framework which not only unifies and generalizes this approach to a wide class of spaces but also leads to an original extension of score matching. We illustrate the resulting class of denoising Markov models on various applications.
translated by 谷歌翻译
The heterogeneity of hardware and data is a well-known and studied problem in the community of Federated Learning (FL) as running under heterogeneous settings. Recently, custom-size client models trained with Knowledge Distillation (KD) has emerged as a viable strategy for tackling the heterogeneity challenge. However, previous efforts in this direction are aimed at client model tuning rather than their impact onto the knowledge aggregation of the global model. Despite performance of global models being the primary objective of FL systems, under heterogeneous settings client models have received more attention. Here, we provide more insights into how the chosen approach for training custom client models has an impact on the global model, which is essential for any FL application. We show the global model can fully leverage the strength of KD with heterogeneous data. Driven by empirical observations, we further propose a new approach that combines KD and Learning without Forgetting (LwoF) to produce improved personalised models. We bring heterogeneous FL on pair with the mighty FedAvg of homogeneous FL, in realistic deployment scenarios with dropping clients.
translated by 谷歌翻译
事实证明,基于得分的生成建模(SGM)是对有限维空间建模密度的非常有效的方法。在这项工作中,我们建议将这种方法扩展到在功能空间上学习生成模型。为此,我们代表光谱空间中的功能数据,以将过程的随机部分与其时空部分解离。然后,我们使用有限尺寸SGM从其随机组件中采样了尺寸降低技术。我们证明了我们的方法对各种多模式数据集进行建模的有效性。
translated by 谷歌翻译